::p_load(
pacman
rio,
here,
httr,
jsonlite,
stringr, tidyverse)
Scraping data from WHO-DON
Load necessary packages
In [1]:
Write a function to get data from WHO-DON website
Using method GET from httr
package to get data from WHO-DON website. The function get_who_news
will take a URL as input and return the data from the API. The function will return NULL
if the request was unsuccessful.
Status code = 200 means the connection is successful. The function will parse the JSON content and return the data.
Next, we will initialize the variables and loop to fetch all pages. The function will check if there is a next page link and keep fetching until there is no next page link.
In [2]:
# Function to get news data from a specific URL
<- function(url) {
get_who_news # Make the GET request to the API
<- GET(url)
response
# Check the status of the response
if (status_code(response) == 200) {
# Parse the JSON content
<- content(response, "text")
content <- fromJSON(content)
data return(data)
else {
} # Return NULL if the request was unsuccessful
return(NULL)
}
}
# Initialize variables
<- "https://www.who.int/api/news/diseaseoutbreaknews"
base_url <- list()
all_news <- base_url
next_url <- TRUE
keep_fetching
# Loop to fetch all pages
while (keep_fetching) {
<- get_who_news(next_url)
data
if (!is.null(data) && "value" %in% names(data)) {
<- c(all_news, data$value)
all_news
# Check if there is a next page link
if ("@odata.nextLink" %in% names(data)) {
<- data[["@odata.nextLink"]]
next_url else {
} <- FALSE
keep_fetching
}else {
} <- FALSE
keep_fetching
} }
Convert data from list to wide dataframe
The data is currently stored as a nested list. We will convert this nested list to a wide data frame for further analysis. We will define a function convert_to_df
that takes the nested list as input and returns a data frame.
Some cleaning steps are performed to remove HTML tags from the text data.
Finally, we will export the data frame to a CSV file for further analysis.
In [3]:
# Define a function to convert the nested list to a data frame
<- function(news_list) {
convert_to_df # Initialize an empty list to hold data frames
<- list()
df_list
# Determine the segment length
<- 22
segment_length
# Iterate through the list in steps of segment_length
for (i in seq(1, length(news_list), by = segment_length)) {
# Extract the current segment
<- news_list[i:(i + segment_length - 1)]
segment
# Convert the segment to a data frame and add it to the list
length(df_list) + 1]] <- as.data.frame(segment)
df_list[[
}
# Combine all data frames into one
<- do.call(rbind, df_list)
combined_df return(combined_df)
}
# Function to remove HTML tags
<- function(text) {
remove_html_tags return(str_replace_all(text, "<[^>]*>", ""))
}
<- convert_to_df(all_news)
all_news_df
%>%
all_news_df mutate(across(where(is.character), remove_html_tags)) %>%
arrange(desc(PublicationDate)) %>%
::export(here("data", "who_dons.csv")) rio
Pivot data into long format
In [4]:
<- import(here("data/who_dons.csv")) %>%
corpus select(
UrlName,
DonId,# Ranking of information based on important
Summary,
Overview,
Epidemiology,
Assessment,
Advice,
FurtherInformation%>%
) # remove value 1 in character columns to NA
mutate(across(where(is.character), ~na_if(., ""))) %>%
# Standardize DonID column, with this format Year-DON-Number
# the UrlName column will be used if DonID is missing
mutate(DonID_standardized = coalesce(DonId, UrlName)) %>%
#relocate DonID_standardized column to the first column
relocate(DonID_standardized, .before = UrlName) %>%
# pivot_longer data, so information from columns summary to FurtherInformation are in one column
pivot_longer(
cols = Summary:FurtherInformation,
names_to = "InformationType",
values_to = "Text")
export(corpus, here("data", "corpus.csv"))
Preview data
In [5]:
<- import(here("data", "who_dons.csv")) all_news_df
Data up to 15 July 2024
In [6]:
glimpse(corpus)
Rows: 18,648
Columns: 5
$ DonID_standardized <chr> "2024-DON525", "2024-DON525", "2024-DON525", "2024-…
$ UrlName <chr> "2024-DON525", "2024-DON525", "2024-DON525", "2024-…
$ DonId <chr> "2024-DON525", "2024-DON525", "2024-DON525", "2024-…
$ InformationType <chr> "Summary", "Overview", "Epidemiology", "Assessment"…
$ Text <chr> "The International Health Regulations (IHR) Nationa…